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Abstract 

Optimization of very expensive black-box functions requires utilization of maximum information gath¬ 
ered by the process of optimization. Model Guided Sampling Optimization (MGSO) forms a more robust 
alternative to Jones’ Gaussian-process-based EGO algorithm. Instead of EGO’s maximizing expected im¬ 
provement, the MGSO uses sampling the probability of improvement which is shown to be helpful against 
trapping in local minima. Eurther, the MGSO can reach close-to-optimum solutions faster than standard 
optimization algorithms on low dimensional or smooth problems. 
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1 INTRODUCTION 


Optimization of expensive empirical functions forms an important topic in many engineering or natural- 
sciences areas. For such functions, it is often impossible to obtain any derivatives or information about 
smoothness; moreover, there is no mathematical expression nor an algorithm to evaluate. Instead, some 
simulation or experiment has to be performed, and the value obtained through a simulation or experiment 
is the value of the objective function being considered. Such functions are also called black-box functions. 
They are usually very expensive to evaluate; one evaluation may cost a lot of time and money to process. 

Because of the absence of derivatives, standard continuous first- or second-order derivative optimiza¬ 
tion methods cannot be used. In addition, the functions of this kind are usually characterized by a high 
number of local optima where simple algorithms can be trapped easily. Therefore, different derivative-free 
optimization methods (often called meta-heuristics) have been proposed. Even though these methods are 
rather slow and sometimes also computationally intensive, the cost of the empirical function evaluations is 
always much higher and the cost of these evaluations dominates the computational cost of the optimization 
algorithm. For this reason, it is crucial to decrease the number of function evaluations as much as possible. 

Evolutionary algorithms constitute a broad family of meta-heuristics that are frequently used for black¬ 
box optimization. Furthermore, some additional algorithms and techniques have been designed to minimize 
the number of objective function evaluations. All of the three following approaches use a model (of a 
different type in each case), which is built and updated within the optimization process. 

Estimation of distribution algorithms (EDAs) [Larranaga and Lozano, 200^ represent one such ap¬ 
proach: EDAs iteratively estimate the distribution of selected solutions (usually solutions with fitness above 
some threshold) and sample this distribution forming a new set of solutions for the next iteration. 

Surrogate modelling is the technique of learning and usage of a regression model of the objective func¬ 
tion |Jin, 2005) . The model (called surrogate model in this context) is then used to evaluate some of the 
candidate solutions instead of the original costly objective function. 

Our method. Model Guided Sampling Optimization (MGSO), takes inspiration from both these ap¬ 
proaches. It uses a regression model of the objective function, which also provides an error estimate. 
However, instead of replacing the objective function with this model, it combines its prediction and the 
error estimate to get a probability of reachiim a better solution in a given point. Similarly to EDAs, the 
MGSO then samples this pseudo-distributiorl4 in order to obtain the next set of solution candidates. 

The MGSO is also similar to Jones’ Efficient Global Optimization (EGO) [Jones et ah, 1998| . Like 
EGO, the MGSO uses a Gaussian process (GP, see [Rasmussen and Williams, 2006] for reference), which 
provides a guide where to sample new candidate solutions in order to explore new areas and exploit promis¬ 
ing parts of the objective-function landscape. Where EGO selects a single or very few solutions maximizing 
a chosen criterion - Expected Improvement (El) or Probability of Improvement (Pol) - the MGSO samples 
the latter criterion. At the same time, the GP serves for the MGSO as a surrogate model of the objective 
function for a small proportion of the solutions. 

This paper further develops the MGSO algorithm introduced in [Bajer et al., 2013[ . It brings several 
improvements (new optimization procedures and more general covariance function) and performance com¬ 
parison to EGO. The following section introduces methods used in the MGSO, Section describes the 
MGSO algorithm, and Sectionl^comprises some experimental results on several functions from the BBOB 
testing set IHansen et al., 200^ 

2 GAUSSIAN PROCESSES 

Gaussian process is a probabilistic model based on Gaussian distributions. This type of model was chosen 
because it predicts the function value in a new point in the form of univariate Gaussian distribution: the 
mean and the standard deviation of the function value are provided. Through the predicted mean, the GP 
can serve as a surrogate model, and the standard deviation is an estimate of the prediction uncertainty in a 
new point. 

'a function proportional to a probability distribution, it’s value is given by the probability of improvement 
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The GP is specified by mean and covariance functions and a relatively small number of covariance 
function’s hyper-parameters. The hyper-parameters are usually fitted by the maximum-likelihood method. 

Let Xn = {xi I X,- e j be a set of N training D-dimensional data points with known dependent- 

variable values y/\i = and /(x) be an unknown function being modelled for which /(x,) = y; for 

all i G {1,...,A^}. The GP model imposes a probabilistic model on the data: the vector of known function 
values y!\i is considered to be a sample of a A^-dimensional multivariate Gaussian distribution with the value 
of the probability density p{yN \ Xn). If we take into consideration a new data point (xat+j ), the value 
of the probability density in the new point is 


p{yN+i IXat+i) 


exp(-|y^+iC^:i_iyjv+i) 

V(2Jt)^+idet(C^+i) 


( 1 ) 


where Cat+i is the covariance matrix of the Gaussian distribution (for which mean is usually set to constant 
zero) and yjv+i = (yi, ■ ■ • ,yN,yN+l) (see |Buche et ah, 2005) for details). This covariance can be written 
as 



where Cff is the covariance of the Gaussian distribution given the N training data points, k is the vector of 
covariances between the new point and training data, and K is the variance of the new point itself. 


Predictions. Predictions in Gaussian processes are made using Bayesian inference. Since the inverse 
j of the extended covariance matrix can be expressed using the inverse of the training covariance , 
and yi\i is known, the density of the distribution in a new point simplifies to a univariate Gaussian with the 
density 

p{yN+l |Xjv+i,yjv) oc exp (3) 

with the mean and variance given by 

.%+! = k^c^'yw, 

■’fyjv+i = K-k^C^'k 

Further details can be found in |Buche et ah, 20051 . 

Covariance functions. The covariance plays a crucial role in these equations. It is defined by the 
covariance-function matrix K and signal noise a as 

Civ = Kat + OlAf (6) 

where Ia^ is the identity matrix of order N. Gaussian processes use parametrized covariance functions K 
describing prior assumptions on the shape of the modelled function. The covariance between the function 
values at two data points x; and xj is given by K{xi,Xj), which forms the {i, y)-th element of the matrix Kff. 
We used the most common squared-exponential function 

K{xi,Xj)=Qexp(^-^{xi-Xj f{xi-Xj)^ , (7) 

which is suitable when the modelled function is rather smooth. The closer the points x,- and xj are, the 
closer the covariance function value is to 1 and the stronger correlation between the function values /(x,) 
and f{xj) is. The signal variance 9 scales this correlation, and the parameter £ is the characteristic length- 
scale with which the distance of two considered data points is compared. Our choice of the covariance 
function was motivated by its simplicity and the possibility of finding the hyper-parameter values by the 
maximum-likelihood method. 


(4) 

(5) 
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Figure 1: Probability of improvement. Rastrigin function in 2D, the GP model is built using 40 data points. 

3 MODEL GUIDED SAMPLING OPTIMIZATION (MGSO) 

The MGSO algorithm is based on a similar idea as EGO. It heavily relies on Gaussian process modelling, 
particularly on its regression capabilities and ability to assess model uncertainty in any point of the input 
space. 

While most variants of EGO calculate new points from the expected improvement (El), The MGSO 
utilizes the probability of improvement which is closer to the basic concept of the MGSO: sampling a 
distribution of promising solution^ 

This probability is taken as a function proportional to a probability density and is sampled producing 
a whole population of candidate solutions - individuals. This is the main difference to EGO which takes 
only very few individuals each iteration, usually the point maximizing El. 



3.1 Sampling 

The core step of the MGSO algorithm is the sampling of the probability of improvement. This probability 
is, for a chosen threshold T of the function value, directly given by the predicted mean /(x) = y and the 
standard deviation f(x) = Sy of the GP model / in any point x of the input space 

Polr(x) = P(/(x) ^ T) = <h > (8) 

which corresponds to the value of cumulative distribution function (CDF) of the Gaussian with density 0 
for the value of threshold T. As a threshold T, values near the so-far optimum (or the global optimum if 
known) are usually taken. 

Even though all the variables come from Gaussian distribution, Pol(x) is definitely not Gaussian-shaped 
since it depends on the threshold T and the black-box function being modelled /. A typical example of the 
landscape of Pol(x) in two dimensions for the Rastrigin function is depicted in Fig.[^ The dependency on 
the black-box function also causes the lack of analytical marginals or derivatives. 

3.2 MGSO Procedure 

The MGSO algorithm starts with an initial random sample from the input space forming an initial pop¬ 
ulation, which is evaluated with the black-box objective function (step (2) in Alg.[^. All the evaluated 
solutions are saved to a database from where they are used as a training set for the GP model. 

The main cycle of the MGSO starts viithfitting the GP model’s (M,) hyper-parameters based on the data 
from the current dataset S; (step (4)). Further, the model’s Pol^' is sampled (step (5)) and supplemented 
with the GP model’s minimum (step (6)), forming up to N new individuals {xj}^j where A is a parameter 
defining the standard population size. The algorithm follows up with the evaluation of the new individuals 

^some EGO variants use Pol, too [Jones, 2001| 
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Algorithm 1 MGSO (Model Guided Sampling Optimization) 

1: Input: / - black-box objective function 
N - standard population size 
r - the number of solutions for dataset restriction 
2: So = <— generate N initial samples and evaluate them by /: yj = /(xy) 

3: while no stopping condition is met, for i= 1,2,... do 

4: M, ■<— build a GP model and fit its hyper-parameters according to the dataset S,_i 

5: {xyl^j ■<— sample the Poll^' (x) forming N new points, optionally with different targets T 

6: Xnjin <— argminx/(x) - find the minimum of the GP (by local cont. optimization) and replace the nearest solution 

from {xyl^j with it 
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{evaluate the population) 

8 

S/^S/_iU{(x;,y;)}^^j 

(augment the dataset) 

9 

(XminDmin) argmin(x_3,)gs,A 

{find the best solution in S;) 

to 

if any rescale condition is met then 


11 

restrict the dataset S/ to the bounding-box 
and linearly transform S; to [—1,1]® 

l;,u,] of the r nearest solutions along the best solution (X[j,in,y„,in) 

12 

end if 
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end while 
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Output: the best found solution (Xnjin,y^[„) 



with the objective function (step (7)), extending the dataset of already-measured samples (step (8)) and 
finding the best so-far optimum (xn,in,yniin) (step (9)). 

Covariance matrix constraint. As every covariance matrix, Gaussian process’ covariance matrix 
is required to be positive semi-definite (PSD). This constraint is checked during sampling, and candidate 
solutions leading to close-to-indefinite matrix are rejected. Although it could cause smaller population- 
size in some iterations, it is an important step: otherwise, Gaussian process training and fitting becomes 
numerically very unstable. If such rejections arise, other two thresholds T for calculating Pol are tested and 
population with the greatest cardinality is taken. These rejections occur when the GP model is sufficiently 
trained and sampled solutions become close to the model’s predicted values. 

Model minimum. New population is supplemented with the minimum x^jin of the model’s predicted 
values found by continuous optimizatior0 (step (6), Xj^jn = argminx/(x)); more precisely, the nearest 
sampled solution is replaced with this minimum (unless less than N solutions were sampled). 

Input space restriction. In current implementation, MGSO requires bounds constraints x e [l,u] .1< 
u G to be defined on the input space, which is used by the algorithm to internal linear scaling of the 
input space to hypercube [—1,1]^. As the algorithm proceeds, the input space can be restricted along the 
so-far optimum to a smaller bounding box (steps (10)-(12)) which is again linearly scaled to [—1,1]^. The 
size of the new box is defined as a bounding box of r nearest samples from the current so-far optimum x^in; 
enlarged by 10% at the boundaries. For the parameter r= 15 • D was used as a rule of thumb. 

This process not only speeds up model fitting and prediction (due to the smaller number of training 
data), but focuses the optimization along the best found solution and broaden small regions of non-zero 
Pol. 

Several criteria are defined to launch such input space restriction, from which the most important is 
occurrence of numerous rejections in sampling due to close-to-indefinite covariance matrix. If the resulting 

^Matlab’s fminsearch was used 
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new bounding box from the restriction is close to the previous box (the coordinates are not smaller than 
[—0.8,0.8]^), the input space restriction is not performed. 

3.3 Gaussian Processes Implementation 

Our Matlab implementation of the MGSO makes use of Gaussian Process Regression and Classification 
Toolbox (GPML Matlab code) - a toolbox accompanying Rasmussen’s and Williams’ monograph (Rasmussen and Williams, 2006| . 
In the current version of the MGSO, Rasmussen’s optimization and model fitting procedure minimize was 
replaced with fmincon from Matlab Optimization toolbox and with Hansen’s Covariance Matrix Adap¬ 
tation (CMA-ES) (Hansen and Ostermeier, 2001) . These functions are used for the minimization of GP’s 
negative log-likelihood in model’s hyper-parameters fitting. Here, fmincon is generally faster, but CMA-ES 
is more robust and does not need a valid initial point. 

The next improvement lies in the employment of the diagonal-matrix characteristic length-scale pa¬ 
rameter in the squared exponential covariance function, sometimes also called covariance function with 
automatic relevance determination (ARD) 

0exp ^ (x; - diag(^) (x,- - xy) j . (9) 

The length-scale parameter in changes to diag(f) - a matrix with diagonal elements formed by the 
elements of the vector £ G R^. The application of ARD was not straightforward, because hyper-parameters 
training tends to stuck in local optima. These cases were indicated by an extreme difference between the 
different scale-length parameter £ components which resulted in poor regression capabilities. Therefore, 
constraints on maximum difference between components of £ were introduced 

£i < 2.5 ||median(T) — £/||. 

4 EXPERIMENTAL RESULTS 

This section comprises quantitative results from several tests of the MGSO as well as brief discussion of the 
usability of the algorithm. The current Matlab implementation of the MGSO algorithtiQhas been tested on 
three different benchmark functions from the BBOB testing set (Hansen et al., 200^ : sphere, Rosenbrock 
and Rastrigin function in three different dimensionalities: 2D, 5D and lOD. Eor these cases, comparison 
with CMA-ES - current state of the art black-box optimization algorithm - and Tomlab’s implementation 
of EGCj^is provided. 

The computational times are not quantified, but whereas CMA-ES performs in orders of tens of sec¬ 
onds, the running times of the MGSO and EGO reaches up to several hours. We consider this drawback 
acceptable since the primary use of the MGSO is expensive black box optimization where a single evalua¬ 
tion of the objective function can easily take many hours or even days and/or cost a considerable amount of 
money (Holena et ah, 200^ . 

In the case of two-dimensional problems, the MGSO performs far better than CMA-ES on quadratic 
sphere and Rosenbrock function. The results on Rastrigin function are comparable, although with greater 
variance (see Eig.|^ the descent of the medians is slightly slower within the first 200 function evaluations, 
but faster thereafter). The Tomlab’s implementation of EGO performs almost equally well as the MGSO 
on sphere function, but on Rosenbrock and Rastrigin, the convergence of EGO is extremely slowed down 
after few iterations, which can be seen in 5D and lOD, too. The positive effect of ARD covariance function 
can be seen quite clearly, especially on Rosenbrock function. The difference between ARD and non-ARD 

"*the source is available at http; //github. com/charypar/gpeda 
^http://tomopt.com/tomlab/products/cgo/solvers/ego.php 
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Figure 2: Medians, and the first and third quartiles of the distances from the best individual to optimum (/^ = /best — /opt) 
with respect to the number of objective function evaluations for three benchmark functions. Quartiles are computed out 
of 15 trials with different initial settings (instances 1-5 and 31-40 in the BBOB framework). 































































































results are hardly to see on sphere function, probably because its symmetry means no improvement in ARD 
covariance employment. 

The performance of the MGSO on five-dimensional problems is similar to 2D cases. The MGSO de¬ 
scends notably faster on non-rugged sphere and Rosenbrock functions, especially if we concentrate on 
depicted cases with a very low number of objective function evaluations (up to 250 • D evaluations). The 
drawbacks of the MGSO is shown on 5D Rastrigin function where it is outperformed by CMA-ES, espe¬ 
cially between ca. 200 and 1200 function evaluations. 

Results of optimization in the case of ten-dimensional problems show that the MGSO works better 
than CMA-ES only on the most smooth sphere function which is very easy to regress by Gaussian process 
model. On more complicated benchmarks, the MGSO is outperformed by CMA-ES. 

The graphs on Fig.j^show that the MGSO is usually slightly slower than EGO in the very first phases 
of the optimization, but EGO quickly stops its progress and does not descent further. This is exactly what 
can be expected from the MGSO in comparison to EGO - sampling the Pol instead of searching for the 
maximum can easily overcome situations where EGO gets stuck in a local optimum. 


5 CONCLUSIONS AND FUTURE WORK 

The MGSO, the optimization algorithm based on a Gaussian process model and the sampling of the prob¬ 
ability of improvement, is intended to be used in the field of expensive black-box optimization. This paper 
summarizes its properties and evaluates its performance on several benchmark problems. Comparison with 
Gaussian-process based EGO algorithm shows that the MGSO is able to easily escape from local optima. 
Further, it has been shown that the MGSO can outperform state-of-the-art continuous black-box optimiza¬ 
tion algorithm CMA-ES in low dimensionalities or on very smooth functions. On the other hand, CMA-ES 
performs better on rugged or high-dimensional benchmarks. 
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